For further
information contact: Martin.Irman@mail.mcgill.ca
(if you would like to use the below-described software – feel free to contact
me)
Java
Cluster is a cluster environment written in Java for automating the execution
of Matlab simulations on multiple computers (currently used on multiple Windows
XP computers). It does not facilitate automatic parallel execution of a Matlab
script (that would be nice, wouldn’t it :) but allows the user to schedule the
execution of a single Matlab script for hundreds of runs automatically on
multiple computers. The environment supports the execution of multiple parallel
“jobs” (a “job” is a Matlab script that should be executed x times) where the
executions will be scheduled according to job priorities (just, priority and
idle scheduling possible). The environment further provide
tools to combine the data generated in the multiple executions.
The
automatic execution of other than Matlab simulation jobs would be possible with
some modification to the cluster software.
This
documentation file contains these parts:
Do not move
or rename this document. Use MsWord to edit.
The cluster
environement consists of 3 componenet:
To be able
to manage simulations from your computer you only have to run the ClusterClient. You need to set it up to connect to the
computer running the ArbiterServer (this can be the
same computer).
(JobServer, ArbiterServer, ClusterClient)
Things that have to be set-up:
(you might try to use the
setup scripts in the install_scripts folder - but
these are matched to a specific computer configuration)
1. Review cluster.properties:
* The matlab path needs to be set
* Comment
out the STANDBY property if you don't want your computers
* to go to standby automatically
2. If you have setup jour JobServer
computers to go to stand-by automatically
don't forget to setup Windows so that the computer
automatically wakes up
on network activity
3. Review install.bat. The
batch file includes system specific information.
* Change
the account information for the "schtasks"
commands
4. Add the java_cluster\matlab
folder to the Matlab path!!!
5. Run install.bat to set-up
automatic start of JobServer and/or
ArbiterServer on
start-up of the computer.
6. You can do this manually without running install.bat
7. If e-mail notification is desired, you have to
set-up ‘blat’ with e-mail account
information. The ‘blat’ is a command line utility that can
send e-mails. It is
included in the ‘system’ folder (detailed documentation is
also included). To
set it up run the following command from the command line:
blat –install <mail_server>
<sender_email> 3 25 cluster <login>
<password>
This only
needs to be done once. As McGill use the following to this utility up:
Blat
–install mailhost.mcgill.ca Martin.Irman@mail.mcgill.ca
3 25 cluster <DAS USERNAME> <DAS PASSWORD>
Now, user e-mail
addresses have to be added to cluster.properties.
Create a line:
Martin = Martin.Irman@mail.mcgill.ca
For every user. If this line is not added to the cluster.properties file for a user
the user will not receive e-mail notification.
8. (not important) Review eject.bat and load.bat in the
\system folder and correct
the CD-ROM drive letter if you want to be able to command
the CD to eject
Logging in:
How to run a simulation:
save results\matlab.mat
save results\matlab.mat a_variable b_variable
save results\ber.mat snr ber
Careful! The following does not work (Matlab does not like the leading
backslash):
save \results\matlab.mat (DOES NOT WORK!)
This commands will save the current workspace (or the
specified variables) in the file matlab.mat (or ber.mat). Later everything that is generated inside the
results folder is moved back to the client where it can be analyzed using the jCombine command
What is priority:
Controlling the cluster:
Users and package scheduling:
Email notification:
Martin = Martin.Irman@mail.mcgill.ca
jSetCombine(strMatlabFile,strVariable [,strName])
Call this function from a simulation .m file to set which variabls should
be combined automatically. Multiple jSetCombine statements can be
included in the simulation script. The command
has to be issued while the
current derectory is
either the results folder or it's parent folder.
Sets which variables should be combine if these are not explicitelly
specified in jCombine.
Specify a matlab .mat file (strMatlabFile)
and
which variable to combine (strVariable).
Optionally specify the name for
the combined variable (strName)
in the results. If not specified the
strVariable is used
... it needs to be specified if strVariable uses
wildcards i.e. '*'.
jSetCombine('matlab.mat','SNR');
jSetCombine('matlab.mat','*','all');
Blacklisting computers:
% Blacklist computer if it does not have
enough memory
%
(physical memory in MB).
requiredMemory = 200;
if (jSystemMemory < requiredMemory)
% Construct en error message
strCause = [' Not enough memory on the system! present: '
num2str(jSystemMemory) ' < required: ' num2str(requiredMemory)];
% Blacklist the computer - this function terminates matlab
execution!!!
jBlacklist(strCause);
end;
if (strcmp(jSystemName,’cluster5’))
jBlacklist;
end;
% This is a
example of a script to be used in a cluster simulation
% Blacklist computer if it does not have
enough memory
% (physical
memory in MB).
requiredMemory = 600;
if (jSystemMemory < requiredMemory)
% Construct en error message
strCause = [' Not enough memory on the system! present: '
num2str(jSystemMemory) ' < required: ' num2str(requiredMemory)];
% Blacklist the computer - this function terminates matlab
execution!!!
jBlacklist(strCause);
end;
% Here would be the code of the simulation
we want to execute (let's just
% generate a
random matrix in this example).
% Generate a random matrix
randn('state',sum(100*clock));
a = rand(2,2);
% Let's wait a
few seconds ... as if we would be doing something
pause(30);
% Save the workspace in the results folder
- only files inside the
% results folder will be retreived back to the cluster client
save results\matlab.mat
% You can use jCombine('rnd_num','matlab.mat','*') to retreive
the
% saved workspace or just jCombine('rnd_num','matlab.mat','a') to
% retreive the variable 'a' from all the 'matlab.mat'
files.
1. Matlab cannot find jCombine:
* Add the \matlab subdirectory of your cluster data to the Matlab
path!
2. Some computers have problems:
* Reboot the
problematic computers
* Check if
there is free space on the disk where the cluster data is
located. If the computer is low on disk
space. Run ClusterClient and
use the 'Purge' button in the 'Servers' window. This will
clean up old
data. Check if it helped.
3. Some computers are running a job that should have
long finished but it’s
not finishing:
* Matlab
does not exit if another instance of Matlab is running at the
same machine and this instance is unresponsive (running
simulations).
As soon as
this Matlab finishes or if you kill it manually also the
Matlab
launched by the cluster will exit and the server will resume
normal operation.
4. An error is reported by the cluster client: ‘error
saving file’:
* Make sure
that you save jour data in the results forlder using
the
command ‘save results\matlab.mat’
and not ‘save \results\matlab.mat’.
Matlab does not like the extra backslash.
5. Some computers are not good for running the
simulations (they don’t have
enough RAM and the swapping could damage the disk):
* Blacklist
the computers where you don’t want to run the simulation. If
a script
determines that a computer is not good enough for the simulation
it can blacklist the computer from further executions of
this script
by calling jBlacklist (refer to
the description above).
6. Blacklisting does not work, the computer is listed
as blacklisted but the
simulations start up on it anyway:
* Make sure
that the name of the computer in the cluster.properties
file
match the name that appears in the blacklist exactly. If not
– change
the name in the cluster.properties
file to match. For example: localhost:8188
does not match 127.0.0.1:8188, alpha4:80 does not match
ALPHA4:80 which does
not match ALPHA4:0080!
7. ClusterClient or ArbiterServer is slow:
* Delete
unused packages
* ‘Purge’
the cluster using the purge button in the ClusterClient
servers
view. Note: this will erase all information about packages
older than 30 days!
* Optimize
the file system. The amount of files in the ‘results’ folders might
slow down the filesystem. You can
speed-up the operation by disabling the
short filename generation by using the following command:
fsutil behavior set disable8dot3 1
Note:
some older application that use the short (8,3)
filenames might stop to work!
9. When I command the servers to eject the CD, the
servers don't do that:
* Is the
correct drive letter specified in the batch files eject.bat
and load.bat on the servers in the
\system folder?
10.Computer goes to stand-by
and never wakes up again:
* The
wake-on-LAN capability has to be enabled to wake-up the computer after it was
sent to stand-by by the arbiter. Try pinging the computer /
connecting via remote
desktop or otherwise disturb the computer – if it does not
come on then there is
some problem with wake-on-LAN.
* Wake-on-LAN
didn’t work with WinXP Service Pack 2 installed, when
we uninstalled it
and installed SP1 only the feature worked.
* Bios or
windows settings might be disabling this feature
* If nothing
helps disable the goto-standby feature on the
computer by commenting out
the STANDBY property in cluster.properties.
11.Arbiter cannot connect to JobServers / ClusterClient cannot
connect to Arbiter:
* Check
windows firewall settings. Optionally enable ports 8188, 8189, 8191.
12.ClusterClient shuts down
when starting Matlab. A error message is generated by the
Java Virtual
Machine that a exception occurred in the AWT package:
* We had a issue with an old ATI graphics card driver that caused
this. Try installing
the newest drivers (do windows update – in our case it
helped).
To do list: